Skip to content

feat: add Vietnamese language support#1013

Merged
thatjuan merged 3 commits into
oramasearch:mainfrom
xirothedev:feat/add-vietnamese-language
Jun 26, 2026
Merged

feat: add Vietnamese language support#1013
thatjuan merged 3 commits into
oramasearch:mainfrom
xirothedev:feat/add-vietnamese-language

Conversation

@xirothedev

@xirothedev xirothedev commented Feb 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds Vietnamese (vi) language support to Orama, including:

  • Stemmer (packages/stemmers/lib/vi.js): Identity function — Vietnamese is an isolating (analytic) language where words do not inflect, so no morphological stemming is needed
  • Stopwords (packages/stopwords/lib/vi.js): Common Vietnamese function words (conjunctions, prepositions, particles, pronouns, etc.)
  • Splitter regex in languages.ts: Covers all Vietnamese diacritics (ă, â, ê, ô, ơ, ư, đ and all tone marks)
  • Build scripts: Updated both stemmers and stopwords build scripts
  • Tests: Added tokenizer test case for Vietnamese

Why Vietnamese?

Vietnamese is spoken by ~100 million native speakers and is one of the most widely used languages in Southeast Asia. Adding support enables Orama users to build search for Vietnamese content.

Technical Notes

Vietnamese is an isolating language — unlike European languages, words do not change form (no conjugation, declension, or inflection). This means:

  • The stemmer is intentionally an identity function (return word)
  • Tokenization relies on the splitter regex and stopwords filtering
  • Vietnamese uses spaces between syllables, so the default space-based tokenization works correctly

Test plan

  • Verify pnpm test passes for the tokenizer tests
  • Verify build scripts generate correct dist output for Vietnamese

xirothedev and others added 3 commits February 12, 2026 00:27
Vietnamese is an isolating (analytic) language where words do not
inflect, so the stemmer is an identity function. Adds:
- Vietnamese stemmer (identity function)
- Vietnamese stopwords list
- Vietnamese splitter regex with full diacritics support
- Tokenizer test for Vietnamese
Vietnamese diacritics encode distinct vowels and tones, so folding them to
ASCII changes word meaning. normalizeToken applied replaceDiacritics to every
language, which partially stripped Vietnamese tokens (e.g. "tài" -> "tai",
"trình" -> "trinh") while leaving Latin Extended Additional characters intact,
producing inconsistent, lossy tokens.

Skip replaceDiacritics for languages whose diacritics are significant, tracked
by a new LANGUAGES_WITH_SIGNIFICANT_DIACRITICS set (currently Vietnamese). This
fixes the new Vietnamese tokenizer test, which expects full diacritic
preservation.
…exports

- Drop dead underscore-joined compound stopwords (có_thể, một_cách, ngay_cả):
  the tokenizer splits on whitespace, so these never match real input, and
  their head words (có, một, ngay) are already individual stopwords.
- Commit the generated @orama/stemmers/vietnamese and @orama/stopwords/vietnamese
  package.json exports (produced by `pnpm build`), matching the existing
  sanskrit convention so the published packages expose the subpaths.
@thatjuan thatjuan merged commit 6af0f8b into oramasearch:main Jun 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants